This homework is due Sunday February 14, 2016 at 11:59 PM. When complete, submit your code in the R Markdown file and the knitted HTML file on Canvas.

Background

This HW is based on Hans Rosling talks New Insights on Poverty and The Best Stats You’ve Ever Seen.

The assignment uses data to answer specific question about global health and economics. The data contradicts commonly held preconceived notions. For example, Hans Rosling starts his talk by asking: (paraphrased) “for each of the six pairs of countries below, which country do you think had the highest child mortality in 2015?”

  1. Sri Lanka or Turkey
  2. Poland or South Korea
  3. Malaysia or Russia
  4. Pakistan or Vietnam
  5. Thailand or South Africa

Most people get them wrong. Why is this? In part it is due to our preconceived notion that the world is divided into two groups: the Western world versus the third world, characterized by “long life,small family” and “short life, large family” respectively. In this homework we will use data visualization to gain insights on this topic.

Problem 1

The first step in our analysis is to download and organize the data. The necessary data to answer these question is available on the gapminder website.

Problem 1.1

We will use the following datasets:

  1. Childhood mortality
  2. Life expectancy
  3. Fertility
  4. Population
  5. Total GDP

Create five tbl_df table objects, one for each of the tables provided in the above files. Hints: Use the read_csv function. Because these are only temporary files, give them short names.

pop <- read.csv("indicator gapminder population - Data.csv")
u5Mort <- read.csv("indicator gapminder under5mortality - Data.csv")
gdp <- read.csv("indicator GDP at market prices, constant 2000 US$ (WB estimates) .xls - Data.csv")
life_exp <- read.csv("indicator life_expectancy_at_birth - Data.csv")
fertile <- read.csv("indicator undata total_fertility - Data.csv")

Problem 1.2

Write a function called my_func that takes a table as an argument and returns the column name. For each of the five tables, what is the name of the column containing the country names? Print out the tables or look at them with View to determine the column.

my_func <- function(x){
  return(names(x))
}
my_func(pop)
my_func(u5Mort)
my_func(gdp)
my_func(life_exp)
my_func(fertile)

The column containing the country name in each data set is as follows: “Total.population” in the population data set; “Under.five.mortality” in the under 5 mortality data set; “GDP..constant.2000.US..” in the GDP at market prices data set; “Life.expectancy.with.projections..Yellow.is.IHME” in the life expectancy at birth data set; and “Total.fertility.rate” in the total fertility data set.

Problem 1.3

In the previous problem we noted that gapminder is inconsistent in naming their country column. Fix this by assigning a common name to this column in the various tables.

colnames(pop)[1] <- "country"
colnames(u5Mort)[1] <- "country"
colnames(gdp)[1] <- "country"
colnames(life_exp)[1] <- "country"
colnames(fertile)[1] <- "country"

Problem 1.4

Notice that in these tables, years are represented by columns. We want to create a tidy dataset in which each row is a unit or observation and our 5 values of interest, including the year for that unit, are in the columns. The unit here is a country/year pair and each unit gets values:

head(pop)
head(u5Mort)
head(gdp)
head(life_exp)
head(fertile)

We call this the long format. Use the gather function from the tidyr package to create a new table for childhood mortality using the long format. Call the new columns year and child_mortality

u5Mort <- u5Mort %>% gather(year, child_mortality, -country)

Now redefine the remaining tables in this way.

pop <- pop %>% gather(year, population, -country)
gdp <- gdp %>% gather(year, gdp, -country)
life_exp <- life_exp %>% gather(year, life_exp, -country)
fertile <- fertile %>% gather(year, fertility, -country)

Problem 1.5

Now we want to join all these files together. Make one consolidated table containing all the columns

joinBy <- c("country", "year")
master <- full_join(u5Mort, gdp, by = joinBy)
master <- full_join(master, pop, by = joinBy)
master <- full_join(master, fertile, by = joinBy)
master <- full_join(master, life_exp, by = joinBy)

Problem 1.6

Add a column to the consolidated table containing the continent for each country. Hint: We have created a file that maps countries to continents here. Hint: Learn to use the left_join function.

continents <- read.table("continent-info.tsv", sep = "\t", quote="\"")
names(continents) <- c("country", "continent")
#Clean continents data
continents$continent <- trimws(continents$continent)
master <- left_join(master, continents)
## Joining by: "country"
#Clean the Data
master$year <- gsub("X", "", master$year)
master$year <- as.numeric(master$year)
master$pop <- gsub(",", "", master$pop)
master$pop <- as.numeric(as.character(master$pop))

Problem 2

Report the child mortalilty rate in 2015 for these 5 pairs:

  1. Sri Lanka or Turkey
  2. Poland or South Korea
  3. Malaysia or Russia
  4. Pakistan or Vietnam
  5. Thailand or South Africa
#1. Sri Lanka or Turkey
subset(master$child_mortality, master$country == "Sri Lanka" & master$year == 2015)
## [1] 8.7
subset(master$child_mortality, master$country == "Turkey" & master$year == 2015)
## [1] 13.5
#2. Poland or South Korea
subset(master$child_mortality, master$country == "Poland" & master$year == 2015)
## [1] 5.2
subset(master$child_mortality, master$country == "South Korea" & master$year == 2015)
## [1] 3.5
#3. Malaysia or Russia
subset(master$child_mortality, master$country == "Malaysia" & master$year == 2015)
## [1] 8.2
subset(master$child_mortality, master$country == "Russia" & master$year == 2015)
## [1] 9.6
#4. Pakistan or Vietnam
subset(master$child_mortality, master$country == "Pakistan" & master$year == 2015)
## [1] 81.1
subset(master$child_mortality, master$country == "Vietnam" & master$year == 2015)
## [1] 21.7
#5. Thailand or South Africa
subset(master$child_mortality, master$country == "Thailand" & master$year == 2015)
## [1] 12.3
subset(master$child_mortality, master$country == "South Africa" & master$year == 2015)
## [1] 42.1

Based on these examples, we see that the child mortality rates in the countries that are generally considered “poorer”, there is not a consistent correlation to higher child mortality rates compared to countries with higher GDPs. This suggest that the issue of child mortality is more complicated than simply being a function of a country’s total wealth.

Problem 3

To examine if in fact there was a long-life-in-a-small-family and short-life-in-a-large-family dichotomy, we will visualize the average number of children per family (fertility) and the life expectancy for each country.

Problem 3.1

Use ggplot2 to create a plot of life expectancy versus fertiltiy for 1962 for Africa, Asia, Europe, and the Americas. Use color to denote continent and point size to denote population size:

p <- ggplot(filter(master, year == 1962, continent != "Oceania"), aes(x=fertility, y = life_exp))
p + geom_point(aes(size = pop, color=continent)) +
  ggtitle("Life Expectancy and Fertility Rates") +
  labs(x="Fertility Rate", y = "Life Expectancy at Birth")

Do you see a dichotomy? Explain.

There does appear to be a dicotomy as the bottom left quadrant of the graph is largely empty. This suggests that there may be distinct patterns in life expectancy going on at different fertility rates. We also notice that the points are clustered by continent. Europe is clearly clustered near the top left portion of the graph with low fertility rates and high life expectancy, while Africa is clustered near the bottom left with high fertility rates and low life expectancy.

Problem 3.2

Now we will annotate the plot to show different types of countries.

Learn about OECD and OPEC. Add a couple of columns to your consolidated tables containing a logical vector that tells if a country is OECD and OPEC respectively. It is ok to base membership on 2015.

# OECD Countries
oecd <- read.csv("OECD.csv")
#OPEC Countries
opec <- read.csv("OPEC.csv")
#Merge OPEC and OECD with Master
master <- left_join(master, oecd)
## Joining by: "country"
master <- left_join(master, opec)
## Joining by: "country"

Problem 3.3

Make the same plot as in Problem 3.1, but this time use color to annotate the OECD countries and OPEC countries. For countries that are not part of these two organization annotate if they are from Africa, Asia, or the Americas.

# Consolidate groups or continents into a singe variable called "memberOf"
master$memberOf <- master$continent
master$memberOf[master$opec == "opec"] <- "OPEC"
master$memberOf[master$oecd == "oecd"] <- "OECD"
#Plot the data
p <- ggplot(filter(master, year == 1962, continent != "Oceania"), aes(x=fertility, y = life_exp))
p + geom_point(aes(size = pop, color=memberOf)) +
  ggtitle("Life Expectancy and Fertility Rates") +
  labs(x="Fertility Rate", y = "Life Expectancy at Birth")

How would you describe the dichotomy? When we graph the data explicitly identifying the OECD and OPEC countries, we can see that there is generally still a distinct dichotomy where the OECD countries are cluster in the top-left quadrant representing low fertility rates and high life expectancies, while OPEC countries see higher fertility rates, thought with perhaps slightly higher life expectancies than non-OPEC African countries. The dichotomy though is not as distinct as the purely continent-based dichotomy as there are a hand full of OECD countries intermingled with OPEC countries. Still, the two groups generally cluster together.

Problem 3.4

Explore how this figure changes across time. Show us 4 figures that demonstrate how this figure changes through time.

# Life Expectancy and Fertility Rates 1972
p <- ggplot(filter(master, year == 1972, continent != "Oceania"), aes(x=fertility, y = life_exp))
p + geom_point(aes(size = pop, color=memberOf)) +
  ggtitle("Life Expectancy and Fertility Rates") +
  labs(x="Fertility Rate", y = "Life Expectancy at Birth")

# Life Expectancy and Fertility Rates 1982
p <- ggplot(filter(master, year == 1982, continent != "Oceania"), aes(x=fertility, y = life_exp))
p + geom_point(aes(size = pop, color=memberOf)) +
  ggtitle("Life Expectancy and Fertility Rates") +
  labs(x="Fertility Rate", y = "Life Expectancy at Birth")

# Life Expectancy and Fertility Rates 1992
p <- ggplot(filter(master, year == 1992, continent != "Oceania"), aes(x=fertility, y = life_exp))
p + geom_point(aes(size = pop, color=memberOf)) +
  ggtitle("Life Expectancy and Fertility Rates") +
  labs(x="Fertility Rate", y = "Life Expectancy at Birth")

# Life Expectancy and Fertility Rates 2002
p <- ggplot(filter(master, year == 2002, continent != "Oceania"), aes(x=fertility, y = life_exp))
p + geom_point(aes(size = pop, color=memberOf)) +
  ggtitle("Life Expectancy and Fertility Rates") +
  labs(x="Fertility Rate", y = "Life Expectancy at Birth")

# Life Expectancy and Fertility Rates 2012
p <- ggplot(filter(master, year == 2002, continent != "Oceania"), aes(x=fertility, y = life_exp))
p + geom_point(aes(size = pop, color=memberOf)) +
  ggtitle("Life Expectancy and Fertility Rates") +
  labs(x="Fertility Rate", y = "Life Expectancy at Birth")

Would you say that the same dichotomy exists today? Explain:

Over the time period from 1962-2012, the general trent of the data begins to take a more linear shape and move generally in the direction from the bottom-right quadrant to the top left quadrant, marking a shift from high fertility rates and low life expectancy to lower fertility rates and higer life expectancy. As in 1962, those countries with high fertility rates and low life expectancy continue to be generally African nations while Europe and the OECD countries remain clustered at the top. With respect to the dichotomy we saw in earlier years, it no longer appears to be the case that the slope of the approximation line is different between the OECD and the OPEC nations. However, there is a wide spread among just the OPEC nations with respect to fertility and life expectancy with some countries intermingled with OECD countries in the low-fertility, high life expectancy quandrant, and others still mired in the high-fertility, low life-expectancy quadrant.

Problem 3.5 (Optional)

Make an animation with the gganimate package.

library(gganimate)
p <- ggplot(filter(master,  continent != "Oceania", year >=1950), aes(x=fertility, y = life_exp, frame = year))
p <- p + geom_point(aes(size = pop, color=memberOf)) +
  ggtitle("Life Expectancy and Fertility Rates") +
  labs(x="Fertility Rate", y = "Life Expectancy at Birth", legend.title = "Country Membership")
gg_animate(p) 

Problem 4

Having time as a third dimension made it somewhat difficult to see specific country trends. Let’s now focus on specific countries.

Problem 4.1

Let’s compare France and its former colony Tunisia. Make a plot of fertility versus year with color denoting the country. Do the same for life expecancy. How would you compare Tunisia’s improvement compared to France’s in the past 60 years? Hint: use geom_line

#Plot of France and Tunisia Fertility Rates over Time
master %>% 
    filter(country %in% c("France", "Tunisia")) %>%
    ggplot(aes(x = year, y = fertility, color = country)) + 
    geom_line() +
    ggtitle("Fertility Rates Over Time") +
    labs(x="Year", y = "Fertility Rate", color = "Country")  

#Plot of France and Tunisia Life Expectancy over Time
master %>% 
    filter(country %in% c("France", "Tunisia")) %>%
    ggplot(aes(x = year, y = life_exp, color = country)) + 
    geom_line() +
    ggtitle("Life Expectancy Over Time") +
    labs(x="Year", y = "Life Expectancy", color = "Country")  

With respect to fertile rates, France’s fertility rates historicall were much lower than Tunisia’s until sometime in the late 1960’s or early 1970’s when fertility rates in Tunisia peaked and and dropped precipitously over the course of the next 30 years until finally evening out with France over the last decade or so near 2.

France’s life expectancy has always been higher than Tunisia’s though perhaps with some greater variation, perhaps due to more detailed reporting. On average, France saw a higher rate of increase in its life expectancy from the beginning of the data in 1800 through just before World War I. The effect of the two world wars can be clearly seen on France’s life expectancy curve over time. Tunisia begins to dramatically improve its life expectancy beginning perhaps in the mid 1920s until flatening out over the last 15 years or so, approaching 80 and converging with France’s life expectancy.

Problem 4.2

Do the same, but this time compare Vietnam to the OECD countries.

#Plot of Vietnam and OECD Fertility Rates over Time
master %>% 
    filter(country == "Vietnam" | memberOf == "OECD") %>%
    ggplot(aes(x = year, y = fertility, color = memberOf)) + 
    geom_line() +
    ggtitle("Fertility Rates Over Time") +
    labs(x="Year", y = "Fertility Rate",color = "Country Membership")  

master %>% 
    filter(country == "Vietnam" | memberOf == "OECD") %>%
    ggplot(aes(x = year, y = fertility, color = country)) + 
    geom_line() +
    ggtitle("Fertility Rates Over Time") +
    labs(x="Year", y = "Fertility Rate", color = "Country")  

#Plot of Vietnam and OECD Life Expectancy over Time
master %>% 
    filter(country == "Vietnam" | memberOf == "OECD") %>%
    ggplot(aes(x = year, y = life_exp, color = memberOf)) + 
    geom_line() +
    ggtitle("Life Expectancy Over Time") +
    labs(x="Year", y = "Life Expectancy", color = "Country Membership") 

master %>% 
    filter(country == "Vietnam" | memberOf == "OECD") %>%
    ggplot(aes(x = year, y = life_exp, color = country)) + 
    geom_line() +
    ggtitle("Life Expectancy Over Time") +
    labs(x="Year", y = "Life Expectancy", color = "Country")  

When we compare Vietnam to the OECD countries, we see that for most of history (dating back at least to 1800), Vietnam’s fertility rate was not dissimilar from many of the OECD coutries. Based on the two fertility graphs, we can see that Vietnam’s fertility rate falls later than most of the OECD countires, though not all OECD countries. That is to say that while it is one of the last countries to see its fertility rate drop, it is not an outlier based on the data from the other OECD countries.

The same can be said of life expectancy estimates in Vietnam compared to the OECD countries. The life expectancy in Vietnam is near the bottom of the pack with respect to life expectancy over time, though it over-performs some OECD countries for most of the time period (1800-2015).

Problem 5

We are now going to examine GDP per capita per day.

Problem 5.1

Create a smooth density estimate of the distribution of GDP per capita per day across countries in 1970. Include OECD, OPEC, Asia, Africa, and the Americas in the computation. When doing this we want to weigh countries with larger populations more. We can do this using the “weight”" argument in geom_density.

# Create GDP per capita per day data
master <- 
  master %>%
    mutate(GDPperCapitaPerDay = (gdp / pop) / 365)

# GDP per Capita per Day (Density).
master %>%
  filter(year == 1970, continent != "Oceania") %>%
  ggplot(aes(x = GDPperCapitaPerDay, weight = pop)) +
  geom_density() +
  ggtitle("GDP per Capita per Day in 1970 (Density)") +
  labs(x = "GDP per Capita per Day ($)", y = "Population Weighted Density")

We can see in this graphic that a large portion of the world population has a very GDP per Capita per Day. There are then two other humps further along the distribution representing wealthier clusters.

Problem 5.2

Now do the same but show each of the five groups separately.

# GDP per Capita per Day for Member Groups
master %>%
  filter(year == 1970, continent != "Oceania") %>%
  ggplot(aes(x = GDPperCapitaPerDay, weight = pop, color = memberOf)) +
  geom_density() +
  facet_grid(~ memberOf) +
  ggtitle("GDP per Capita per Day in 1970 (Density)") +
  labs(x = "GDP per Capita per Day ($)", y = "Population Weighted Density")

We can see from this graphic that the bulk of the low GDP per Capita per Day population is located in Asia, likely driven by the large populations and extreme poverty in countries like India and China. There is also a spike near zero in Africa. We see that the two other clusters were driven by a bi-modal distribution of GDP per capita in the OECD.

Problem 5.3

Visualize these densities for several years. Show a couple of of them. Summarize how the distribution has changed through the years.

# GDP per Capita per Day for Member Groups (1970)
master %>%
  filter(year == 1970, continent != "Oceania") %>%
  ggplot(aes(x = GDPperCapitaPerDay, weight = pop, color = memberOf)) +
  geom_density() +
  facet_grid(~ memberOf) +
  ggtitle("GDP per Capita per Day in 1970 (Density)") +
  labs(x = "GDP per Capita per Day ($)", y = "Population Weighted Density", color = "Country Membership")

# GDP per Capita per Day for Member Groups (1980)
master %>%
  filter(year == 1980, continent != "Oceania") %>%
  ggplot(aes(x = GDPperCapitaPerDay, weight = pop, color = memberOf)) +
  geom_density() +
  facet_grid(~ memberOf) +
  ggtitle("GDP per Capita per Day in 1980 (Density)") +
  labs(x = "GDP per Capita per Day ($)", y = "Population Weighted Density", color = "Country Membership")

# GDP per Capita per Day for Member Groups (1990)
master %>%
  filter(year == 1990, continent != "Oceania") %>%
  ggplot(aes(x = GDPperCapitaPerDay, weight = pop, color = memberOf)) +
  geom_density() +
  facet_grid(~ memberOf) +
  ggtitle("GDP per Capita per Day in 1990 (Density)") +
  labs(x = "GDP per Capita per Day ($)", y = "Population Weighted Density", color = "Country Membership")

# GDP per Capita per Day for Member Groups (2000)
master %>%
  filter(year == 2000, continent != "Oceania") %>%
  ggplot(aes(x = GDPperCapitaPerDay, weight = pop, color = memberOf)) +
  geom_density() +
  facet_grid(~ memberOf) +
  ggtitle("GDP per Capita per Day in 2000 (Density)") +
  labs(x = "GDP per Capita per Day ($)", y = "Population Weighted Density", color = "Country Membership")

# GDP per Capita per Day for Member Groups (2010)
master %>%
  filter(year == 2010, continent != "Oceania") %>%
  ggplot(aes(x = GDPperCapitaPerDay, weight = pop, color = memberOf)) +
  geom_density() +
  facet_grid(~ memberOf) +
  ggtitle("GDP per Capita per Day in 2010 (Density)") +
  labs(x = "GDP per Capita per Day ($)", y = "Population Weighted Density", color = "Country Membership")

We can see from the graphs that the wealth in the OECD countries as increased, but has too wealth in Asia. Although the region certaily represents the bulk of the low income population, we can see that the distribution has widened, indicating that some people in those countries have been made better off over the last 50 years. We see a widening of distributions across all regions indicating a similar pattern.